Machine Learning in Finance
Module 3: Introduction
Introduction to AI
AI, ML, DL
Artificial Intelligence
Artificial Intelligence (AI) : intelligent machine
Those machines are called intelligent as:
They have decision-making capabilities
like human beings (i.e., “smart”)
Programming and ML
Machine Learning
Machine Learning (ML): A subset of AI
- A technology to allow computer programs to learn the patterns and rules by improving through experience
Algorithms make the computer to learn
Some from Statistics, others from Computer science
Neural network (Deep Learning) one of such algorithm
Programming and ML
Types of ML
Types of ML
- Unsupervised Learning
- cluster, associate, reduce dimensions
- (Semi) Supervised Learning
- predict, forecast, classify
- Reinforcement Learning
- Multi-stage decision making
Unsupervised Learning
Unsupervised Learning
Find patterns and “cluster” those have similar patterns found using \(X\).
Principal Component Analysis (PCA)
- Shrinks \(X\), a.k.a. dimentionality reduction
Hierarchical (K-means) clustering
- Clusters observations to (K) groups
Latent Dirichlet Allocation (LDA)
- Topic modeling
Neural Network (e.g., autoencoder)
Use of Unsupervised Learning
Unsupervised learning is useful for
Descriptive purpose (patterns)
Garnering insights from data
Dementionality reduction (feature selections)
Unsupervised Learning in Finance:
Stock (Fund) clustering for portfolio
- Cluster stocks that co-move
Clustering
- Factor zoo: CAPM / Fama / Cahart / etc…
- Country risk: Legal, Inflation, Peace etc…
Transaction Anomaly detection
Sentiment analysis (topic discovery)
Preprocessing step
Wordcloud
How do I know if it is Unsupervised?
You know by asking yourself if the data has ground truth (Y) or not, aside from predictor (X) variables.
Group investors by their trading patterns of each brokerage account holders
Cluster credit card transactions types patterns by time, location, amount, frequencies
Unsupervised Leraning Types
Unsupervised learning problems can be grouped into three:
Clustering:
- Row-wise groupings based on the predictor variables
- Rule: Use Xs to grouper variable
Association:
- Find frequent combination of categorical variables
- Item-wise groupings such as associating keywords to text
- Rule: Xs to X (co-occurance)
Dimentionality reduction
- Reduce variables to remove redundance
- Preprocessing step
Supervised Learning
Supervised Learning
Supervised learning method’s objective is:
For truth value Y, find the best predictor model f given data X for
\[ Y \sim f(X) \]
- Therefore, supervised requires Y!
Supervised Learning algorithms
Linear regressions
Logistic regressions (categorical)
Decision Trees
Boosted Trees
Neural network
Regression and Classification
Supervised learning can be grouped into regression and classification problems.
When the predicted variable, Y, is:
Continuous variable: Regression
- Stock return, Option price, volatility, earnings forecast (EPS), …
Discrete variable: Classification
- Bank failure, Boom/bust, Positive/Negative, etc…
Quiz
Supervised / Unsupervised?
Regression / Classification, or Clustring / Association?
Supervised / Unsupervised?
Regression / Classification, or Clustring / Association?
Supervised / Unsupervised?
Regression / Classification, or Clustring / Association?
Supervised / Unsupervised?
Regression / Classification, or Clustring / Association?
Supervised / Unsupervised?
Regression / Classification, or Clustring / Association?
Supervised / Unsupervised?
Regression / Classification, or Clustring / Association?
Supervised / Unsupervised?
Regression / Classification, or Clustring / Association?
Model Validation
Train-test split (2 part)
Simplest splitting scheme.
We need data to train/fit parameters for the specified model.
- Also called as “building a model”
Then we need to validate the model performance with “unseen” data.
- Typically 70/30, 80/20
Train-valid-test split (3 part)
ML operations involve the hyperparameter tuning process.
- Hyperparameter
-
Hyperparameters are parameters in the configuration of the model that are NOT LEARNED from the model but set prior to the training process. They govern the performance of the ML and also the training time.
Why additional split?
Wihtout validation data, the test results can be biased towards on a specific hyptermarameter setting.
Set apart one more data for one step further validation
Typically 70/15/15
Pick the model with best perfomance on valid, and final check on test
Cross validation (CV)
Instead of 3 part split (train-valid-test), a more robust technique used to access ML performance.
k-fold CV
K-fold CV
Divide train data into k equally-sized folds
Each fold is once used as validation set
Average out the performance by each hyperparameter setting
Pick the best model
Final check on test (this hold-out set is optional)
ML Performance
Accuracy of the model
How can we determine the “quality” of ML model?
What makes one ML model “better”?
Bias / Variance
Note
Unsupervised algorithms do not have ground TRUTH to measure accuracy. Therefore, discussions on bias/variance tradeoffs are relevant to supervised algorithms.
Two properties of accuracy
- Unbiasedness (less bias or error)
- Consistency (less variance)
Bias
Bias is the error of the prediction of a model.
Especially related to train set prediction error.
Regression: \(\frac{1}{n}\sum\limits_{i = 1}^{n} (f(X_i) -Y_i)^2\)
Classification: \(\frac{1}{n}\sum\limits_{i = 1}^{n} 1_{f(X_i) \neq Y_i}\)
Low bias: predicted data points are close to the target
High bias: predicted data points are far from the target.
Bias in Regression
Based on the training set, we have three linear models:
- Linear model: High bias, strong linear relationship assumption of model
- Quadratic model: Lower bias, fits data farely well
- Higher order polinomial: Lowest bias, fits extremely well
Model complexity and accuracy
In general,
Higher complexity, higher accuracy
Higher complexity, lower interpretability (black-boxy)
Low bias, is that all?
If we attain low bias (high accuracy) from our train data in our model:
- Should it work well with outside (new) data?
We care about the model’s generalize-ability.
- It is of no use if model works only well with train data!
Therefore:
We must test the model accuracy with new data
that is not used for training the model
Variance
Variance quantifies the sensitivity of parameter estimations to fluctuations in the data setting.
It checks out how “reliable” the model is, out of sample.
High variance: the model does not generalize well
Low variance: the model performance is reliable when outside data is given
Bias / Variance Tradeoff
Example : model1
- Linear model: low variance, slope (parameter estimate) and accuracy of model does not change much
Underfitting problem
It did not learn enough!
If we use trained model for prediction on new dataset:
- Not impressive test accuracy
Example: model2
- Quadratic model: moderate variance, a moderate change in slope.
Example: model3
- Higher-order polynomial model: high variance as the parameters change a lot with new data
Overfitting problem
It learned too much noise!
If we use trained model for prediction on new dataset:
A large drop in test accuracy compared to training
Classic overfitting problem
Not reliable to unseen data
Balanced model
Achieves good balance
If we use trained model for prediction on new dataset:
- Less change in accuracy, similar to train accuracy
In a nutshell
Class exercise
Housekeeping
Plot them
# Plot fitted line on top of train set
base_plot <- train |>
ggplot(aes(x = Age, y = Salary)) +
geom_point() +
theme_bw()
plot_linear <- base_plot +
geom_smooth(method = "lm", formula = y ~ x, se = FALSE)
plot_quad <- base_plot +
geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE)
plot_poly <- base_plot +
geom_smooth(method = "lm", formula = y ~ poly(x, 5), se = FALSE)
plot_linear
plot_quad
plot_polyFit the model, compare accuracy
What is the train accuracy (using \(R^2\) as metric)?
Test them
# A tibble: 10 × 5
Age Salary Predicted_sal_1 Predicted_sal_2 Predicted_sal_3
<dbl> <dbl> <dbl> <dbl> <dbl>
1 30 166000 165980. 160184. 117293.
2 26 78000 150670. 113172. 111072.
3 58 310000 273144. 271282. 245230.
4 29 100000 162152. 149161. 104937.
5 40 260000 204253. 243654. 284898.
6 27 150000 154498. 125655. 99725.
7 33 140000 177461. 190334. 173238.
8 61 220000 284626. 260559. 257932.
9 27 86000 154498. 125655. 99725.
10 48 276000 234871. 275396. 277161.
Calculate R-squared
\[ R^2 = 1- \frac{\sum(y_i - \hat{y_i})^2}{\sum(y_i - \bar{y})^2} = 1 - \frac{SSR}{SST} \]
Calculate R2
Plotting fitted values on test
base_plot <- test |>
ggplot(aes(x = Age)) +
theme_bw() +
geom_point(aes(y = Salary), color = "black")
model1_plot <- base_plot +
geom_point(aes(y = Predicted_sal_1), color = "blue4") +
geom_smooth(
aes(y = Predicted_sal_1),
method = "lm",
formula = y ~ x,
color = "blue4",
se = FALSE
)
model2_plot <- base_plot +
geom_point(aes(y = Predicted_sal_2), color = "green4") +
geom_smooth(
aes(y = Predicted_sal_2),
method = "lm",
formula = y ~ poly(x, 2),
color = "green4",
se = FALSE
)
model3_plot <- base_plot +
geom_point(aes(y = Predicted_sal_3), color = "red4") +
geom_smooth(
aes(y = Predicted_sal_3),
method = "lm",
formula = y ~ poly(x, 5),
color = "red4",
se = FALSE
)
model1_plotLab problem
Lab problem
Using above train / test data, fit the model with 3rd order polynomial regression:
\[ Salary = \beta_0 + \beta_1Age + \beta_2Age^2 + \beta_3Age^3 + e \]
\[ Salary = \beta_0 + \beta_1Age + \beta_2Age^2 + \beta_3Age^3 + \beta_4Age^4 + \epsilon \]
Report train accuracy (\(R^2\)) and test accuracy. Which model performs better?
Visualize your work, and work with Quarto and render .html for report.
Homework Reading
Reading
John C. Hull “Machine Learning in Business”
- Chapter 1